This notebook outlines my process of tree based and Neural Network models. This notebook is dependent on the data table gameInfo generated from DataExtraction.RMD.

Packages

library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
  method         from
  print.tbl_lazy     
  print.tbl_sql      
-- Attaching packages --------------------------------------------------------------------------------- tidyverse 1.3.1 --
v ggplot2 3.3.5     v purrr   0.3.4
v tibble  3.1.2     v dplyr   1.0.7
v tidyr   1.1.3     v stringr 1.4.0
v readr   1.4.0     v forcats 0.5.1
-- Conflicts ------------------------------------------------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
Warning message:
In read_python_versions_from_registry("HCU", key = "PythonCore") :
  Unexpected format for PythonCore version: 3.10
library(data.table)
data.table 1.14.0 using 4 threads (see ?getDTthreads).  Latest news: r-datatable.com

Attaching package: ‘data.table’

The following objects are masked from ‘package:dplyr’:

    between, first, last

The following object is masked from ‘package:purrr’:

    transpose
library(randomForest)
Warning: package ‘randomForest’ was built under R version 4.1.2
randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.

Attaching package: ‘randomForest’

The following object is masked from ‘package:dplyr’:

    combine

The following object is masked from ‘package:ggplot2’:

    margin
library(rpart.plot)
Warning: package ‘rpart.plot’ was built under R version 4.1.2
Loading required package: rpart
Warning: package ‘rpart’ was built under R version 4.1.2
library(word2vec)
Warning: package ‘word2vec’ was built under R version 4.1.2
library(Rtsne)
Warning: package ‘Rtsne’ was built under R version 4.1.2
library(plotly)
Warning: package ‘plotly’ was built under R version 4.1.2

Attaching package: ‘plotly’

The following object is masked from ‘package:ggplot2’:

    last_plot

The following object is masked from ‘package:stats’:

    filter

The following object is masked from ‘package:graphics’:

    layout
library(keras)
Warning: package ‘keras’ was built under R version 4.1.2
library(tfruns)
Warning: package ‘tfruns’ was built under R version 4.1.2
library(rsample)
Warning: package ‘rsample’ was built under R version 4.1.2

Loading Data from other part

load("../data/league.RDATA")
Warning message:
In read_python_versions_from_registry("HCU", key = "PythonCore") :
  Unexpected format for PythonCore version: 3.10

List to Store Results

data.tree <- list(
  models = list(),
  plots = list(),
  temp.data = list()
)
championCluster <- list(
  models = list(),
  plots = list(),
  temp.data = list()
)

Wrangling Data

So I want to make a basic tree classifier of projected winning team comps. For now, a basic model of simple champion tags will be used.

Setting up Training / Test Data

# Setting Seed for Reproducibility
set.seed(3)
data.tree$temp.data$sample <- sample(data.tree$temp.data$gameInfo.tree$match, nrow(data.tree$temp.data$gameInfo.tree)*.7)
data.tree$temp.data$train <- data.tree$temp.data$gameInfo.tree %>% 
  filter(match %in% data.tree$temp.data$sample)
data.tree$temp.data$test <- data.tree$temp.data$gameInfo.tree %>% 
  filter(!match %in% data.tree$temp.data$sample)

Generating Random Forest

set.seed(3)
data.tree$models$teamComp_forest <- randomForest(
  team_win ~ . - match,
  data = data.tree$temp.data$train,
  ntree = 500,
  importance = TRUE,
  na.action = na.omit
)

data.tree$models$teamComp_forest

Call:
 randomForest(formula = team_win ~ . - match, data = data.tree$temp.data$train,      ntree = 500, importance = TRUE, na.action = na.omit) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 3

        OOB estimate of  error rate: 50.08%
Confusion matrix:
      1     2 class.error
1 11396 12530   0.5236981
2 11377 12438   0.4777241
importance(data.tree$models$teamComp_forest)
                    1          2 MeanDecreaseAccuracy MeanDecreaseGini
Assassin_1  9.9540136  -6.814106            5.1561345         332.2679
Fighter_1  17.2205799 -13.858278            5.5311721         346.1510
Marksman_1  9.5225581  -8.116976            1.7123134         293.4874
Tank_1      7.4799838  -4.724412            3.9559451         285.2230
Mage_1     16.9019553 -14.924233            2.9464378         284.1115
Support_1  12.2424356 -11.952428            1.8623555         242.4686
Assassin_2  0.3371864  -2.088303           -2.3073435         359.3653
Fighter_2   2.4827423  -4.897604           -2.7091299         425.1212
Marksman_2  3.2506393  -4.942703           -1.7623174         369.9343
Tank_2      3.0022087  -5.144555           -2.4517398         338.7178
Mage_2      1.0833946  -1.633843           -0.6288478         389.8527
Support_2   0.6855808  -5.485852           -5.9403592         288.7111
varImpPlot(data.tree$models$teamComp_forest)

Let’s compare to a simple blue side always wins classifier:

data.tree$temp.data$gameInfo.tree %>% 
  count(team_win) %>% 
  mutate(n = n/sum(n))

Well, it’s slightly better than the naive blue side win classifier but clearly the number of champions with tags isn’t a very strong predictor of team success. With the current coding, I’m fairly certain that there won’t really be a robust classifier.

Let’s try to identify clusters of champion types. # Generating Input Team Sentences

Generating Model

Pretty clearly 5 main clusters of champions each corresponding to a role. Doesn’t really help too much in determining team compositions. I could set up a KNN to verify this but it seems pretty clear cut to me.

Neural Network

Wrangle Data

data.NN <- list()
data.NN$data.temp <- championCluster$temp.data$teams %>% 
  select(!match)
  
data.NN$data.temp

Running Model - See TeamCompNN.R

Hyperparameter Tuning

runs <- tuning_run(
  "TeamCompNN.R",
  flags = list(
    dropout = c(0.2, 0.3, 0.4, 0.5),
    unit = c(8, 16, 64)
  )
)

runs %>% 
  arrange(desc(metric_val_accuracy))
# So a dropout of .3 and 8 unit dense network seems to produce the best validation error
results
     loss  accuracy 
0.6919282 0.5214252 

Around 52% accuracy, not the best, but not bad considering the variance of league of legends.

Saving Model

save_model_tf(model, "initialNN.tf")
2021-12-21 18:54:50.959358: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.

Evaluating Example Team

model %>% predict("Sett Trundle Kindred Ziggs Leona")
          [,1]
[1,] 0.5169869

A very weird way to code a team comp predictor - I’ll try a different method in Part 3.

---
title: "Trees and Support Vector Machines"
output: html_notebook
---

This notebook outlines my process of tree based and Neural Network models. This notebook is dependent on the data table gameInfo generated from DataExtraction.RMD.

# Packages
```{r}
library(tidyverse)
library(data.table)
library(randomForest)
library(rpart.plot)
library(word2vec)
library(Rtsne)
library(plotly)
library(keras)
library(tfruns)
library(rsample)
```

# Loading Data from other part
```{r}
load("../data/league.RDATA")
```

# List to Store Results
```{r}
data.tree <- list(
  models = list(),
  plots = list(),
  temp.data = list()
)
championCluster <- list(
  models = list(),
  plots = list(),
  temp.data = list()
)
```


# Wrangling Data
So I want to make a basic tree classifier of projected winning team comps. For now, a basic model of simple champion tags will be used.
```{r}
data.tree$temp.data$gameInfo.temp <- gameInfo %>% 
  left_join(
    champions.scraped,
    by = c("championName" = "name")
  ) %>% 
  group_by(match) %>% 
  mutate(
    team = rleid(win)
  ) %>% 
  ungroup()

data.tree$temp.data$gameInfo.tags <- data.tree$temp.data$gameInfo.temp %>% 
  group_by(match, team) %>% 
  count(tag) %>% 
  ungroup() %>% 
  pivot_wider(
    names_from = tag,
    values_from = n
  ) %>% 
  pivot_wider() %>% 
  replace(is.na(.), 0) 


data.tree$temp.data$gameInfo.tree <- data.tree$temp.data$gameInfo.temp %>% 
  filter(win == TRUE) %>% 
  select(match, team_win = team) %>% 
  distinct(match, .keep_all = T) %>% 
  mutate(
    team_win = factor(team_win, levels = c(1, 2))
  ) %>% 
  left_join(
    data.tree$temp.data$gameInfo.tags %>% 
      filter(team == 1) %>% 
      rename_with(
        .fn = function(x){
          
          paste0(x, "_1") %>% 
            return()
          
        },
        .cols = 3:8
      ) %>% 
      select(!team),
    by = "match"
  ) %>% 
  left_join(
    data.tree$temp.data$gameInfo.tags %>% 
      filter(team == 2) %>% 
      rename_with(
        .fn = function(x){
          
          paste0(x, "_2") %>% 
            return()
          
        },
        .cols = 3:8
      ) %>% 
      select(!team),
    by = "match"
  ) %>% 
  mutate_if(is.integer, as.factor)

data.tree$temp.data$gameInfo.tree
```

# Setting up Training / Test Data
```{r}
# Setting Seed for Reproducibility
set.seed(3)
# Next time use rsample 
data.tree$temp.data$sample <- sample(data.tree$temp.data$gameInfo.tree$match, nrow(data.tree$temp.data$gameInfo.tree)*.7)
data.tree$temp.data$train <- data.tree$temp.data$gameInfo.tree %>% 
  filter(match %in% data.tree$temp.data$sample)
data.tree$temp.data$test <- data.tree$temp.data$gameInfo.tree %>% 
  filter(!match %in% data.tree$temp.data$sample)
```

# Generating Random Forest
```{r}
set.seed(3)
data.tree$models$teamComp_forest <- randomForest(
  team_win ~ . - match,
  data = data.tree$temp.data$train,
  ntree = 500,
  importance = TRUE,
  na.action = na.omit
)

data.tree$models$teamComp_forest
```
```{r}
importance(data.tree$models$teamComp_forest)
varImpPlot(data.tree$models$teamComp_forest)
```
Let's compare to a simple blue side always wins classifier:
```{r}
data.tree$temp.data$gameInfo.tree %>% 
  count(team_win) %>% 
  mutate(n = n/sum(n))
```
Well, it's slightly better than the naive blue side win classifier but clearly the number of champions with tags isn't a very strong predictor of team success. With the current coding, I'm fairly certain that there won't really be a robust classifier.

Let's try to identify clusters of champion types.
# Generating Input Team Sentences 
```{r}
championCluster$temp.data$teams <- gameInfo %>% 
  select(match, win, championName) %>% 
  group_by(match, win) %>% 
  mutate(championNumber = row_number()) %>% 
  pivot_wider(
    names_from = championNumber,
    values_from = championName
  ) %>% 
  transmute(match = match, win = win, team = str_c(`1`,`2`,`3`,`4`,`5`, sep = " ")) %>% 
  ungroup() 

championCluster$temp.data$teams
write_csv(championCluster$temp.data$teams, "../data/teamNames.csv")
```
# Generating Model
```{r}
set.seed(3)
championCluster$models$nlpModel <- word2vec(
  x = championCluster$temp.data$teams$team, 
  type = "skip-gram", 
  dim = 20, 
  iter = 15
)

# Embedding Matrix
championCluster$models$embeddingMatrix <- as.matrix(championCluster$models$nlpModel)

# Applying TSne 
championCluster$models$Tsne <- Rtsne(championCluster$models$embeddingMatrix, pca = FALSE)

championCluster$plots$map <- championCluster$models$Tsne$Y %>% 
  as.data.frame() %>%
  mutate(champion = row.names(championCluster$models$embeddingMatrix)) %>%
  ggplot(aes(x = V1, y = V2, label = champion)) + 
  geom_point() 

championCluster$plots$map <- championCluster$plots$map %>% 
  ggplotly()

championCluster$plots$map 
```
Pretty clearly 5 main clusters of champions each corresponding to a role. Doesn't really help too much in determining team compositions. I could set up a KNN to verify this but it seems pretty clear cut to me.

# Neural Network
## Wrangle Data
```{r}
data.NN <- list()
data.NN$data.temp <- championCluster$temp.data$teams %>% 
  select(!match)
  
data.NN$data.temp
```

# Running Model - See TeamCompNN.R
## Hyperparameter Tuning
```{r}
runs <- tuning_run(
  "TeamCompNN.R",
  flags = list(
    dropout = c(0.2, 0.3, 0.4, 0.5),
    unit = c(8, 16, 64)
  )
)

runs %>% 
  arrange(desc(metric_val_accuracy))
# So a dropout of .3 and 8 unit dense network seems to produce the best validation error
```

```{r}
results
```
Around 52% accuracy, not the best, but not bad considering the variance of league of legends.

# Saving Model
```{r}
save_model_tf(model, "initialNN.tf")
```
```{r include = F}
model <- load_model_tf("./initialNN.tf")
```


# Evaluating Example Team
```{r}
model %>% predict("Sett Trundle Kindred Ziggs Leona")
```
A very weird way to code a team comp predictor - I'll try a different method in Part 3.
